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Abstract 

To facilitate future research in unsupervised 
induction of syntactic structure and to stan- 
dardize best-practices, we propose a tagset 
that consists of twelve universal part-of- 
speech categories. In addition to the tagset, 
we develop a mapping from 25 different tree- 
bank tagsets to this universal set. As a re- 
sult, when combined with the original tree- 
bank data, this universal tagset and mapping 
produce a dataset consisting of common parts- 
of-speech for 22 different languages. We high- 
light the use of this resource via two experi- 
ments, including one that reports competitive 
accuracies for unsupervised grammar induc- 
tion without gold standard part-of-speech tags. 



1 Introduction 

Part-of-speech (POS) tagging has received a great 
deal of attention as it is a critical component of 
most natural language processing systems. As su- 
pervised POS tagging accuracies for English (mea- 
sured on the Wall Street Journal portion of the Pen- 
nTreebank (Marcus et al., 1993)) have converged to 
around 97.3% (Toutanova et al., 2003; Shen et al., 
2007), the attention has shifted to unsupeivised ap- 
proaches (Christodoulopoulos et al., 2010). In par- 
ticular, there has been growing interest in both multi- 
lingual POS induction (Snyder et al., 2009; Naseem 
et al., 2009) and cross-lingual POS induction via 
treebank projection (Yarowsky and Ngai, 2001; Xi 
and Hwa, 2005; Das and Petrov, 2011). 

Underlying these studies is the idea that a set of 
(coarse) syntactic POS categories exist in similar 



forms across languages. These categories are often 
called universals to represent their cross-lingual na- 
ture (Carnie, 2002; Newmeyer, 2005). For example, 
Naseem et al. (2009) used the Multext-East (Erjavec, 
2004) corpus to evaluate their multi-lingual POS 
induction system, because it uses the same tagset 
for multiple languages. When corpora with com- 
mon tagsets are unavailable, a standard approach is 
to manually define a mapping from language and 
treebank specific fine-grained tagsets to a prede- 
fined universal set. This was the approach taken by 
Das and Petrov (201 1) to evaluate their cross-lingual 
POS projection system for six different languages. 

To facilitate future research and to standardize 
best-practices, we propose a tagset that consists 
of twelve universal POS categories. While there 
might be some controversy about what the exact 
tagset should be, we feel that these twelve cate- 
gories cover the most frequent part-of-speech that 
exist in most languages. In addition to the tagset, 
we also develop a mapping from fine-grained POS 
tags for 25 different treebanks to this universal 
set. As a result, when combined with the origi- 
nal treebank data, this universal tagset and mapping 
produce a dataset consisting of common paits-of- 
speech for 22 different languages.' Both the tagset 
and mappings are made available for download at 
http://code.google.coin/p/universal-pos-tags/. 

This resource serves multiple purposes. First, as 
mentioned previously, it is useful for building and 
evaluating unsupervised and cross-lingual taggers. 
Second, it also permits for a more reasonable com- 



'We include mappings for two different Chinese, German 
and Japanese treebanks. 



sentence: The oboist Heinz Holliger has taken a hard line about the problems 
original: DT NN NNP NNP VBZ VBN DT JJ NN IN DT NNS 
universal: DET NOUN NOUN NOUN VERB VERB DET ADJ NOUN ADP DET NOUN 

Figure 1 : Example English sentence with its language specific and corresponding universal POS tags. 



parison of accuracy across languages for supervised 
taggers. Statements of the form "POS tagging for 
language X is harder than for language Y" are vac- 
uous when the tagsets used for the two languages 
are incomparable (not to mention of different car- 
dinality). Finally, it also permits language technol- 
ogy practitioners to train POS taggers with common 
tagsets across multiple languages. This in turn facil- 
itates downstream application development as there 
is no need to maintain language specific rules due to 
differences in treebank annotation guidelines. 

In this paper, we specifically highlight two use 
cases of this resource. First, using our universal 
tagset and mapping, we run an experiment compar- 
ing POS tag accuracies for 25 different treebanks to 
evaluate POS tagging accuracy on a single tagset. 
Second, we combine the cross-lingual projection 
part-of-speech taggers of Das and Petrov (2011) 
with the grammar induction system of Naseem et al. 
(2010) - which requires a universal tagset - to pro- 
duce a completely unsupervised grammar induction 
system for multiple languages, that does not require 
gold POS tags in the target language. 

2 Tagset 

While there might be some disagreement about the 
exact definition of an universal POS tagset (Evans 
and Levinson, 2009), it seems fairly indisputable 
that a set of coarse POS categories (or syntactic uni- 
versal) exists across all languages in one form or 
another (Carnie, 2002; Newmeyer, 2005). Rather 
than arguing over definitions, we took a pragmatic 
standpoint during the design of the universal POS 
tagset and focused our attention on the POS cate- 
gories that we expect to be most useful (and nec- 
essary) for users of POS taggers. In our opinion, 
these are NLP practitioners using POS taggers in 
downstream applications, and NLP researchers us- 
ing POS taggers in grammar induction and other ex- 
periments. 

A high-level analysis of the tagsets underlying 
various treebanks shows that the majority of tagsets 



are very fine-grained and language specific. In fact. 
Smith and Eisner (2005) made a similar observation 
and defined a collapsed set of 17 English POS tags 
(instead of the original 45) that has subsequently 
been adopted by most unsupervised POS induction 
work. Similarly, the organizers of the CoNLL shared 
tasks on dependency parsing provide coarse (but 
still language specific) tags in addition to the fine- 
grained tags used in the original treebanks (Buch- 
holz and Marsi, 2006; Nivre et al, 2007). McDon- 
ald and Nivre (2007) identified eight different coarse 
POS tags when analyzing the errors of two depen- 
dency parsers on the 13 different languages from the 
CoNLL shared tasks. 

Our universal POS tagset unifies this previous 
work and defines the following twelve POS tags: 
Noun (nouns). Verb (verbs), ADJ (adjectives), 
ADV (adverbs), PRON (pronouns), DET (determin- 
ers and aiticles), ADP (prepositions and postposi- 
tions), NUM (numerals), CONJ (conjunctions), PRT 
(particles), '.' (punctuation marks) and X (a catch-all 
for other categories such as abbreviations or foreign 
words). 

We did not rely on intiinsic definitions of the 
above categories. Instead, each category is defined 
operationally. For each treebank under considera- 
tion, we studied the exact POS tag definitions and 
annotation guidelines and created a mapping from 
the original treebank tagset to these universal POS 
tags. Most of the decisions were fairly clear. For 
example, from the PennTreebank, VB, VBD, VBG, 
VBN, VbP, VBZ and MD (modal) were all mapped 
to Verb. A less clear case was the universal tag for 
particles, PRT, which was mapped from POS (pos- 
sessive), RP (particle) and TO (the word 'to'). In 
particular, the TO tag is ambiguous in the PennTree- 
bank between infinitival markers and the preposition 
'to'. Thus, under this mapping, some prepositions 
will be marked as particles in the universal tagset. 
Figure 1 gives an example mapping for a sentence 
from the PennTreebank. 

Another case we had to consider is that some tag 



Language 


Source 


#Tags 


0/0 


U/U 


O/U 


Arabic 


PADT/CoNLLO? (Hajic et al., 2004) 


21 


96.1 


96.9 


97.0 


Basque 


Basque3LB/CoNLL07 (Aduriz et al., 2003) 


64 


89.3 


93.7 


93.7 


Bulgarian 


BTB/C0NLLO6 (Simov et al., 2002) 


54 


95.7 


97.5 


97.8 


Catalan 


CESS-ECE/C0NLLO7 (Marti et al., 2007) 


54 


98.5 


98.2 


98.8 


Chinese 


Penn ChineseTreebank 6.0 (Palmer et al., 2007) 


34 


91.7 


93.4 


94.1 


Chinese 


Sinica/CoNLL07 (Chen et al., 2003) 


294 


87.5 


91.8 


92.6 


Czech 


PDT/C0NLLO7 (Bohmova et al., 2003) 


63 


99.1 


99.1 


99.1 


Danish 


DDT/C0NLLO6 (Kromann et al., 2003) 


25 


96.2 


96.4 


96.9 


Dutch 


Alpino/CoNLL06 (Van der Beek et al., 2002) 


12 


93.0 


95.0 


95.0 


English 


PennTreebank (Marcus et al., 1993) 


45 


96.7 


96.8 


97.7 


French 


FrenchTreebank (Abeille et al., 2003) 


30 


96.6 


96.7 


97.3 


German 


Tiger/CoNLL06 (Brants et al., 2002) 


54 


97.9 


98.1 


98.8 


German 


Negra (Skut et al., 1997) 


54 


96.9 


97.9 


98.6 


Greek 


GDT/C0NLLO7 (Prokopidis et al., 2005) 


38 


97.2 


97.5 


97.8 


Hungarian 


Szeged/CoNLL07 (Csendes et al., 2005) 


43 


94.5 


95.6 


95.8 


Italian 


ISST/C0NLLO7 (Montemagni et al., 2003) 


28 


94.9 


95.8 


95.8 


Japanese 


Verbmobil/CoNLL06 (Kawata and Bartels, 2000) 


80 


98.3 


98.0 


99.1 


Japanese 


Kyoto4.0 (Kurohashi and Nagao, 1997) 


42 


97.4 


98.7 


99.3 


Korean 


Sejong (http://www.sejong.or.kr) 


187 


96.5 


97.5 


98.4 


Portuguese 


Floresta Sinta(c)tica/CoNLL06 (Afonso et al., 2002) 


22 


96.9 


96.8 


97.4 


Russian 


SynTagRus-RNC (Boguslavsky et al., 2002) 


11 


96.8 


96.8 


96.8 


Slovene 


SDT/C0NLLO6 (Dzeroski et al., 2006) 


29 


94.7 


94.6 


95.3 


Spanish 


Ancora-Cast3LB/CoNLL06 (Civit and Marti, 2004) 


47 


96.3 


96.3 


96.9 


Swedish 


Talbanken05/CoNLL06 (Nivre et al., 2006) 


41 


93.6 


94.7 


95.1 


Turkish 


METU-Sabanci/CoNLL07 (Oflazer et al., 2003) 


31 


87.5 


89.1 


90.2 



Table 1: Data sets, number of language specific tags in the original treebank, and tagging accuracies for training/testing 
on the original (O) and the universal (U) tagset. Where applicable, we indicate whether the data set was extracted from 
the CoNLL 2006 (Buchholz and Marsi, 2006) or CoNLL 2007 (Nivre et al., 2007) versions of the corpora. 



categories do not occur in all languages. Consider 
for example the case of adjectives. While all lan- 
guages have a way of describing the properties of 
objects (which themselves are typically referred to 
with nouns), many have argued that Korean does 
not technically have adjectives, but instead expresses 
properties of nouns via stative verbs (Kim, 2002). 
As a result, in our mapping for Korean, we mapped 
stative verbs to the universal ADJ tag. In other cases 
this was clearer, e.g. the Bulgarian treebank has no 
category for determiners or articles. This is not to 
say that there are no determiners in the Bulgarian 
language. However, since they are not annotated as 
such in the treebank, we are not able to include them 
in our mapping. 

The list of treebanks for which we have con- 
structed mappings can be seen in Table 1. One main 
objective in publicly releasing this resource is to pro- 
vide treebank and language specific experts a mech- 
anism for refining these categories and the decisions 



we have made, as well as adding new treebanks and 
languages. This resource is therefore hosted as an 
open source project with version control. 

3 Experiments 

To demonstrate the efficacy of the proposed univer- 
sal POS tagset, we performed two sets of exper- 
iments. First, to provide a language comparison, 
we trained the same supervised POS tagging model 
on all of the above treebanks and evaluated the tag- 
ging accuracy on the universal POS tagset. Second, 
we used universal POS tags (automatically projected 
from English) as the starting point for unsupervised 
grammar induction, producing completely unsuper- 
vised parsers for several languages. 

3.1 Language Comparisons 

To compare POS tagging accuracies across different 
languages we trained a supervised tagger based on 
a trigram Markov model (Brants, 2000) on all tree- 



banks. We chose this model for its fast speed and 
(close to) state-of-the-art accuracy without language 
specific tuning.^ 

Table 1 shows the results for all 25 treebanks 
when training/testing on the original (O) and univer- 
sal (U) tagsets. Overall, the variance on the universal 
tagset has been reduced by half (5. 1 instead of 10.4). 
But of course there are still accuracy differences 
across the different languages. On the one hand, 
given a golden segmentation, tagging Japanese is al- 
most deterministic, resulting in a final accuracy of 
above 99%.^ On the other hand, tagging Turkish, 
an agglutinative language with and average sentence 
length of 11.6 tokens, remains very challenging, re- 
sulting in an accuracy of only 90.2%. 

It should be noted that the best results are obtained 
by training on the original treebank categories and 
mapping the predictions to the universal POS tags 
at the end (0/U column). This is because the tran- 
sition model based on the universal POS tagset is 
less informative. An interesting experiment would 
be to train the latent variable tagger of Huang et al. 
(2009) on this tagset. Their model automatically dis- 
covers refinements of the observed categories and 
could potentially find a tighter fit to the data, than 
the one provided by the original, Unguistically moti- 
vated treebank tags. 

3.2 Grammar Induction 

We further demonstrate the utility of the universal 
POS tags in a grammar induction experiment. To 
decouple the challenges of POS tagging and pars- 
ing, golden POS tags are typically assumed in un- 
supervised grammar induction experiments (Carroll 
and Charniak, 1992; Klein and Manning, 2004).^* 
We propose to remove this unrealistic simplification 
by using POS tags automatically projected from En- 
glish as the basis of a grammar induction model. 

Das and Petrov (2011) describe a cross-lingual 
projection framework to learn POS taggers with- 
out labeled data for the language of interest. We 
use their automatically induced POS tags to induce 

"Trained on the English PennTreebank this model achieves 
96.7% accuracy when evaluated on the original 45 POS tags. 

^Note that the accuracy on the universal POS tags for the 
two Japanese treebanks is almost the same. 

''a less benevolent explanation for this practice is that gram- 
mar induction from plain text is simply still too difficult. 



T .ansjuasje 


DMV 


PGI 


USR-G 


USR-I 


Danish 


33.5 


41.6 


55.1 


41.7 


Dutch 


37.1 


45.1 


44.0 


38.8 


German 


35.7 


_5 


60.0 


55.1 


Greek 


39.9 


_5 


60.3 


53.4 


Italian 


41.1 


_5 


47.9 


41.4 


Portuguese 


38.5 


63.0 


70.9 


66.4 


Spanish 


28.0 


58.4 


68.3 


43.3 


Swedish 


45.3 


58.3 


52.6 


59.4 



Table 2: Grammar induction results in terms of directed 
dependency accuracy. DMV, PGI and use fine-grained 
gold POS tags, while USR-G and USR-I uses gold and 
automatically projected universal POS tags respectively. 



syntactic dependencies. To this end, we chose the 
framework of Naseem et al. (2010), in which a few 
universal syntactic rules (USR) are used to con- 
strain a probabilistic Bayesian model. These rules 
are specified using a set of universal syntactic cat- 
egories, and lead to state-of-the-art grammar in- 
duction performance superior to previous methods, 
such as the dependency model with valence (DMV) 
(Klein and Manning, 2004) and the phylogenetic 
grammar induction model (PGI) (Berg-Kirkpatrick 
and Klein, 2010). 

In their experiments, Naseem et al. also used a set 
of universal categories, however, with some differ- 
ences to the tagset presented here. Their tagset does 
not have punctuation and catch-all categories, but in- 
cludes a category for auxiliaries. The auxiliary cate- 
gory helps define a syntactic rule that attaches verbs 
to an auxiliary head, which is beneficial for certain 
languages. However, since this mle is reversed for 
other languages, we omit it in our tagset. Addition- 
ally, they also used refined categories in the form of 
CoNLL treebank tags. In our experiments, we did 
not make use of refined categories, as the POS tags 
induced by Das and Petrov (2011) were all coarse. 

We present results on the same eight Indo- 
European languages as Das and Petrov (2011), so 
that we can make use of their automatically pro- 
jected POS tags. For all languages, we used the tree- 
banks released as a part of the CoNLL-X (Buchholz 
and Marsi, 2006) shared task. We only considered 
sentences of length 10 or less, after the removal of 
punctuations. We performed Bayesian inference on 

'Not reported by Berg-Kirkpatrick and Klein (2010). 



the whole treebank and report dependency attach- 
ment accuracy. 

Table 2 shows directed dependency accuracies for 
the DMV and PGI models using fine-grained gold 
POS tags. For the USR model, it reports results on 
gold universal POS tags (USR-G) and automatically 
induced universal POS tags (USR-I). The USR-I 
model falls short of the USR-G model, but has the 
advantage that it does not require any labeled data 
from the target language. Quite impressively, it does 
better than DMV for all languages, and is competi- 
tive with PGI, even though those models have access 
to fine-grained gold POS tags. 

4 Conclusions 

We proposed a POS tagset consisting of twelve 
categories that exists across languages and devel- 
oped a mapping from 25 language specific tagsets 
to this universal set. We demonstrated experimen- 
tally that the universal POS categories generalize 
well across language boundaries on an unsupervised 
grammar induction task, giving competitive parsing 
accuracies without relying on gold POS tags. The 
tagset and mappings are available for download at 
http://code.google.coin/p/universal-pos-tags/ 
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